NVIDIA Advances LLM Inference with Unified CPU-GPU Memory Architecture
NVIDIA's latest innovation targets the growing computational demands of large language models. The Grace Blackwell and Grace Hopper architectures now feature NVLink C2C, a 900 GB/s interconnect enabling seamless memory sharing between CPUs and GPUs. This breakthrough addresses critical bottlenecks in running models like Llama 3 70B and Llama 4 Scout 109B, which require up to 218 GB of memory in half-precision mode.
The unified memory architecture eliminates redundant data transfers, particularly benefiting KV cache operations during inference. By allowing GPU-constrained systems to tap into CPU memory resources, Nvidia effectively redefines the hardware requirements for cutting-edge AI workloads. The technology debuts in the GH200 Grace Hopper Superchip, combining 96 GB of high-bandwidth GPU memory with system-wide memory coherence.